CS 2980 : Model - based Semantic Compression in Database Project Report

نویسنده

  • Ugur Çetintemel
چکیده

We are now in big data era, advances in data collection and management technologies have led to large databases. For example, domains such as Medicine, Biology, Music and experimental sciences in general, are all characterized by large data sequences. Considering the amount of space the big data required and the amount of IO required to fetch the big data, data compression has become an efficient way to reduce the database size and improve the query performance. Data compression can be lossless or lossy. Lossless compression aims to provide the exact query answer, whereas lossy compression aims to minimize the amount of data that needs to be held. Obviously, lossless compression needs more computation and metadata to preserve the entire information. Also, providing an exact answer to a complex query in large data recording environment can take minutes, or even hours, while users are more prone to searching for certain patterns of behavior that they approximately visualize, rather than for specific values. Thus, lossy compression become popular in analytical use. Semantic compression is a kind of lossy compression. Since semantic compression account for and exploit both the meanings and dynamic ranges of error individual attributes; and existing data dependencies and correlations among attributes in the table [3], it is very effective for table-data compression. In this project, columns are compressed based on their associations with other columns by using predictive models. Weka is a comprehensive suite of Java class libraries that implement many machine learning and data mining algorithms. It contains implementations of many algorithms for classification and numeric prediction [7], which can help us to build predictive model for target columns to achieve semantic compression. Inspired by semantic compression and Weka, this project implemented a system that columns are compressed based on their associations with other columns by using Weka models. The system will predict the result rather than read it directly from database when given a SQL query together with an error tolerance, therefore to achieve higher performance improvements in speed.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Column-based Database Semantic Compression and Prediction-based Query Optimization

Data compression is pretty important nowadays. Because the amount of data is just too huge and redundant, compression is a necessary for data-centric systems. This project aims to build a system for (lossy) data compression and prediction basing on compressed data on column-based databases. The reason why it mainly aims on column-based databases is that our compression strategy makes more sense...

متن کامل

Data Mining Meets Network Management: The NEMESIS Project

Modern communication networks generate large amounts of operational data, including traffic and utilization statistics and alarm/fault data at various levels of detail. These massive collections of network-management data can grow in the order of several Terabytes per year, and typically hide “knowledge” that is crucial to some of the key tasks involved in effectively managing a communication n...

متن کامل

Analysis of User query refinement behavior based on semantic features: user log analysis of Ganj database (IranDoc)

Background and Aim: Information systems cannot be well designed or developed without a clear understanding of needs of users, manner of their information seeking and evaluating. This research has been designed to analyze the Ganj (Iranian research institute of science and technology database) users’ query refinement behaviors via log analysis.    Methods: The method of this research is log anal...

متن کامل

CSCI 2980 Project Report Data Migration from S-Store to BigDAWG

From spring 2016, I've been working with Prof. Stan Zdonik in a project about data migration from S-Store to BigDAWG polystore system. S-Store, which built on top of H-Store, is the world's first transactional streaming database system. S-Store maintains all the transactional support in a traditional relational database, while it supports streaming processing which is needed in the real-time ap...

متن کامل

A new classification method based on pairwise SVM for facial age estimation

This paper presents a practical algorithm for facial age estimation from frontal face image. Facial age estimation generally comprises two key steps including age image representation and age estimation. The anthropometric model used in this study includes computation of eighteen craniofacial ratios and a new accurate skin wrinkles analysis in the first step and a pairwise binary support vector...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014